{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c54486e0",
   "metadata": {},
   "source": [
    "<p> SPDX-FileCopyrightText: Copyright (c) <2024> NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
    " SPDX-License-Identifier: Apache-2.0\n",
    "\n",
    " Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    " you may not use this file except in compliance with the License.\n",
    " You may obtain a copy of the License at\n",
    "\n",
    " http://www.apache.org/licenses/LICENSE-2.0\n",
    "\n",
    " Unless required by applicable law or agreed to in writing, software\n",
    " distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    " WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    " See the License for the specific language governing permissions and\n",
    " limitations under the License.\n",
    " </p>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae478601-2c1b-4265-85ba-5349afad48fd",
   "metadata": {},
   "source": [
    "# Quantization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "904e78d7-f126-4869-a7ff-633861def3ae",
   "metadata": {},
   "source": [
    "**Quantization** is a technique used to reduce the size of a model's weights, thereby decreasing memory usage and potentially increasing the speed of the model. It achieves this by converting weights from a high-precision format, such as 32-bit floating-point, to a lower-precision format, such as 4-bit or 8-bit integers. This can significantly reduce the model's size and enhance its prediction speed.\r\n",
    "\r\n",
    "The quantization process involves the following steps:\r\n",
    "\r\n",
    "- **Loading a model checkpoint:** Utilize a suitable parallelism strategy to load the model.\r\n",
    "- **Calibrating the model:** Determine the appropriate algorithm-specific scaling factors.\r\n",
    "- **Outputting the model:** Generate an output directory with the quantized model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27e29dd-503a-4c9b-aa69-2d40bb5b9b0f",
   "metadata": {},
   "source": [
    "### Imports and Dependencies\n",
    "Begin by importing the required libraries and dependencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "465a1b79-5088-4162-841c-76493db64a3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dependencies\n",
    "import torch\n",
    "import os\n",
    "import time\n",
    "import torch.nn as nn\n",
    "import modelopt.torch.quantization as mtq\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "from modelopt.torch.export import export_tensorrt_llm_checkpoint\n",
    "from peft import  PeftModel \n",
    "from datasets import load_dataset\n",
    "from torch.utils.data import DataLoader"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "72b011fa-1288-4b94-b317-a4719d4e3e4e",
   "metadata": {},
   "source": [
    "#### Constants:\n",
    "- **Hugging Face Access Token**: Import the value of the Hugging Face Access Token from the environment variables or a separate file.\n",
    "- **Model ID**: Add the ID of the base model to the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "051645b8-0b7d-4f76-8fbd-5559c1e85483",
   "metadata": {},
   "outputs": [],
   "source": [
    "hf_token= os.environ.get(\"HF_TOKEN\")\n",
    "# Change model_id\n",
    "model_id = \"meta-llama/Meta-Llama-3-8B-Instruct\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2208de33-7c8d-4809-8c89-cd609297060f",
   "metadata": {},
   "source": [
    "### Quantization\n",
    "Post Training Quantization(PTQ) enables deploying a model in a low-precision format – FP8, INT4, or INT8 – for efficient serving. Different quantization methods are available including FP8 quantization, INT8 SmoothQuant, and INT4 AWQ.\n",
    "\n",
    "Now we will be define the path of recently merged model and also selects the appropriate device based on the hardware available, allowing for seamless execution on both GPU and CPU.\n",
    "\n",
    "**Ensure the correct merged checkpoint path is specified below.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80014d11-6b29-4985-a082-5997b7b7df29",
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_model= \"/project/data/scratch/merged\"\n",
    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c73adeb-e538-4c6a-9fa2-27275e7ed602",
   "metadata": {},
   "source": [
    "We are using INT4 AWQ Method. Activation-aware Weight Quantization (AWQ), is a technique for compressing and accelerating Large Language Models (LLMs) by reducing the precision of the model weights. AWQ focuses on low-bit, weight-only quantization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35c889ab-fc0f-410c-a1fa-64f3cb657eaf",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select the quantization config\n",
    "config = mtq.INT4_AWQ_CFG\n",
    "\n",
    "calib_size=32\n",
    "block_size=512\n",
    "batch_size=1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7087b706-4dfe-49d5-92d8-fc9adeea2f76",
   "metadata": {},
   "source": [
    "Now, we will load the tokenizer and the recently merged model. After loading, we will calibrate the model. The calibration process in Post-Training Quantization (PTQ) is a crucial step that involves adjusting the quantization parameters of a model to minimize the loss of accuracy that typically occurs when the model's numerical precision is reduced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e41ee45-897d-49c1-b816-03015d2eb247",
   "metadata": {},
   "outputs": [],
   "source": [
    "# The forward loop is used to pass data through the model in-order to collect statistics for calibration. \n",
    "# It should wrap around the calibration dataloader and the model.\n",
    "def calibrate_loop(model):\n",
    "\t\"\"\"Adjusts weights and scaling factors based on selected algorithms.\"\"\"\n",
    "\tfor idx, data in enumerate(calib_dataloader):\n",
    "\t\tprint(f\"Calibrating batch {idx}\")\n",
    "\t\tmodel(data)\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=merged_model)\n",
    "model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=merged_model, torch_dtype=torch.float16, device_map=device)\n",
    "\n",
    "tokenizer.pad_token = tokenizer.eos_token\n",
    "\n",
    "# Prepare calibration data\n",
    "dataset2 = load_dataset(\"cnn_dailymail\", name=\"3.0.0\", split=\"train\").select(range(512))\n",
    "dataset2 = dataset2[\"article\"][:calib_size]\n",
    "batch_encoded = tokenizer.batch_encode_plus(dataset2, return_tensors=\"pt\", padding=True, truncation=True, max_length=block_size)\n",
    "batch_encoded = batch_encoded.to(device)\n",
    "batch_encoded = batch_encoded[\"input_ids\"]\n",
    "calib_dataloader = DataLoader(batch_encoded, batch_size=batch_size, shuffle=False)\n",
    "\n",
    "# PTQ with in-place replacement to quantized modules\n",
    "with torch.no_grad():\n",
    "\tprint(\"starting quantization (mtq.quantize API call)...\")\n",
    "\tstart_time = time.time()\n",
    "\tmodel=mtq.quantize(model, config, calibrate_loop)\n",
    "\tend_time = time.time()\n",
    "\tprint(f\"done, time taken = {end_time - start_time} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba93ee30-b803-4934-b5cd-5054347f3a4d",
   "metadata": {},
   "source": [
    "### 3. Exporting Quantized model\n",
    "As model is quantized now, it can be exported to a TensorRT-LLM checkpoint, which includes\n",
    "\n",
    "- One json file recording the model structure and metadata, and\n",
    "- One or several rank weight files storing quantized model weights and scaling factors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b1b3cfd-af82-465c-8416-e0ba888134e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "export_dir = \"/project/data/scratch/merged-int4\" #change checkpoint directory\n",
    "decoder_type = \"llama\" #change decoder_type according to the supported model\n",
    "inference_tensor_parallel = 1\n",
    "inference_pipeline_parallel = 1\n",
    "\n",
    "# Export Quantized Model\n",
    "with torch.inference_mode():\n",
    "    export_tensorrt_llm_checkpoint(\n",
    "        model,\n",
    "        decoder_type,\n",
    "        torch.float16,\n",
    "        export_dir,\n",
    "        inference_tensor_parallel,\n",
    "        inference_pipeline_parallel\n",
    "    )\n",
    "\n",
    "print(\"\\ntrt-llm checkpoint export done\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24ce4b03-36e1-42f8-bf2b-a23f396df996",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
